Advanced Graphics and Data Visualization in R

Lecture 02: ggplot2 and choosing the right visualization for your audience

0.1.0 An overview of Advanced Graphics and Data Visualization in R

“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.

This lesson is the second in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will delve into some helpful visualization available through the ggplot2 package. First we’ll go over deciding which visualizations are best for what we want to accomplish and then explore some of these in more detail.

At the end of this lecture you will have covered the following topics

  1. The grammar of graphics.
  2. Scatterplots and their variants.
  3. Barplots and their variants.
  4. Density plots and their variants.
  5. Categorical plots and their variants.

0.3.0 A legend for text format in R markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced

Yellow box: Risk or caution

Green boxes: Recommended reads and resources to learn Python

Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.


0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your RStudio folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto datatools Hub. You will need to use your UTORid credentials to complete the login process. From there you will find each week’s lecture files in the directory /2024-03-Adv_Graphics_R/Lecture_XX. You will find a partially coded skeleton.Rmd file as well as all of the data files necessary to run the week’s lecture.

Alternatively, you can download the R-Markdown Notebook (.Rmd) and data files from the RStudio server to your personal computer if you would like to run independently of the Toronto tools.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs

As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.

0.4.4 Data used in this lesson

Today’s datasets will focus on some more Ontario public health unit data. We’ll deep dive a bit more into the demographics of SARS-CoV-2 infection, hospitalization, and survival.

0.4.4.1 Dataset 2: COVID-19_map_data_220303.csv

Retrieved from the open data sets available at Public Health Ontario, this data provides summary statistic information for all PHUs in Ontario including the population size for each PHU.

0.4.4.2 Dataset 1: Ontario_daily_change_in_cases_by_phu_long.tsv

This data originated from the Ontario Provincial government but was tidied during lecture 01 to reflect a long-format set of data compatible with our needs.

0.4.4.3 Dataset 3: Ontario_age_sex_COVID-19_cases.csv

Retrieved from the open data sets available at Public Health Ontario, this dataset focuses on summarizing PHU data across age and sex groups.


0.5.0 Packages used in this lesson

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

lubridate and zoo are helper packages used for working with date formats in R

ggbeeswarm, ggridges and GGally which are new packages we need to install. They’ll help us generate some nice graphs.

# ---------- Install packages here ---------- #
# install.packages("ggbeeswarm", dependencies = TRUE)
# install.packages("GGally")
# ---------- Source packages here ---------- #
# Packages to help tidy our data
library(tidyverse)

# Packages for the graphical analysis section
library(viridis)

# New visualisation packages
library(ggridges) # ridgeline plots
## Error in library(ggridges): there is no package called 'ggridges'
library(GGally) # Parallel coordinate plots
## Error in library(GGally): there is no package called 'GGally'
library(ggbeeswarm)

# packages used for working with/formating dates in R
library(lubridate) 
library(zoo)

1.0.0 Why is data visualization important?

How do we extract meaningful insights from our data? If you have previously explored Anscombe’s quartet you’ll know that, as scientists and lay people, we can sometime be obsessed with summary statistics - mean, median, mode, standard deviation. While these values are a helpful way to quickly assess a population, they can be flawed to the point of deception. Instead, we should temper our investigations by visualizing our data. A deeper understanding of your data trends and potential models comes from dissecting attributes of your data which can jump out more easily through visualization.

Equally important is the ability to convey our findings to others. The right visualizations, whether simplistic or complex, should effectively communicate our key message.


1.1.0 The grammar of graphics

One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005). The idea of grammar can be summarized as follows:

  • Grammar is the foundational set of rules that define the components of a language.
  • A language is built on a structure that consists of syntax and semantics.

The grammar of graphics facilitates the concise description of any components of any graphics. Hadley Wickham of tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework. By following a layered approach of defined components, it can be easy to build a visualization.

The Major Components of the Grammar of Graphics by Dipanjan Sarkar

We can break down the above pyramid by the base components, building from the bottom upwards.

1. Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?

2. Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.

3. Scale: do you need to scale/transform any values to fit your data within a range? This includes layers that map between the data and the aesthetics.

4. Geometric objects: how will you display your data within your visualization. Which geom_* will you use?

5. Statistics: are there additional summary statistics that should be included in the visualization? Some examples include central tendency, spread, confidence intervals, standard error, etc.

6. Facets: will generating subplots of the data add a dimension to our visualization that would otherwise be lost?

7. Coordinate system: will your visualization follow a classic cartesian, semi-log, polar, etc. coordinate system?


1.2.0 Cumulative caseload data by PHU

Let’s look at a summarized data set this week with cumulative case counts across all PHUs in Ontario. What is nice about this data set is that it also carries some population data with it to give us a sense of proportions. This data comes from a period where cumulative case counts still had some meaning to them (March 2022) so we will use it to help demonstrate some of the information we want to visualize.

Steps we’ll take in working with this data:

  1. Open up the file data/COVID-19_map_data_220303.csv with read_csv()
  2. Adjust the column names
  3. Drop the “recent” data but keep the cumulative data

Let’s open that up with our friend read_csv()

# Read in PHU_population_information.csv
phu_information.df <- read_csv("data/COVID-19_map_data_220303.csv")
## Rows: 35 Columns: 19
## -- Column specification -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (1): Geographic area
## dbl (18): Recent Case Count, Recent Case Rate, Cumulative Case Count, Cumula...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the structure and preview the data
str(phu_information.df)
## spc_tbl_ [35 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Geographic area                                     : chr [1:35] "Ontario" "Algoma Public Health" "Brant County Health Unit" "Chatham-Kent Public Health" ...
##  $ Recent Case Count                                   : num [1:35] 25751 733 233 322 876 ...
##  $ Recent Case Rate                                    : num [1:35] 175 622 152 302 151 ...
##  $ Cumulative Case Count                               : num [1:35] 1107408 5275 9908 6924 48526 ...
##  $ Cumulative Case Rate                                : num [1:35] 7516 4476 6452 6494 8342 ...
##  $ Cumulative Hospitalizations Count                   : num [1:35] 42132 206 369 278 2252 ...
##  $ Cumulative Hospitalizations Rate                    : num [1:35] 286 175 240 261 387 ...
##  $ Cumulative Deaths Count                             : num [1:35] 12497 33 76 72 523 ...
##  $ Cumulative Deaths Rate                              : num [1:35] 84.8 28 49.5 67.5 89.9 62.3 98.2 28.4 70.8 49.8 ...
##  $ At least one dose count                             : num [1:35] 12489709 97881 122613 85128 474994 ...
##  $ Completed primary series count                      : num [1:35] 11947990 93941 117653 81743 454665 ...
##  $ At least one dose coverage (%)                      : num [1:35] 84.8 83.1 79.8 79.8 81.7 84.3 81.7 76.6 76.5 82.6 ...
##  $ Completed primary series coverage (%)               : num [1:35] 81.1 79.7 76.6 76.7 78.2 81.1 78.1 73.7 73.8 79.6 ...
##  $ At least one dose count among eligible              : num [1:35] 12486011 97873 122608 85120 474910 ...
##  $ Completed primary series count among eligible       : num [1:35] 11946134 93935 117650 81737 454628 ...
##  $ At least one dose coverage (%) among eligible       : num [1:35] 89.1 86.7 84.6 83.8 86 89.3 85.9 80.8 80.4 86 ...
##  $ Completed primary series coverage (%) among eligible: num [1:35] 85.3 83.2 81.2 80.5 82.4 85.9 82.2 77.7 77.6 82.9 ...
##  $ Population                                          : num [1:35] 14734014 117840 153558 106620 581722 ...
##  $ Eligible population                                 : num [1:35] 14010998 112839 144945 101565 552042 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Geographic area` = col_character(),
##   ..   `Recent Case Count` = col_double(),
##   ..   `Recent Case Rate` = col_double(),
##   ..   `Cumulative Case Count` = col_double(),
##   ..   `Cumulative Case Rate` = col_double(),
##   ..   `Cumulative Hospitalizations Count` = col_double(),
##   ..   `Cumulative Hospitalizations Rate` = col_double(),
##   ..   `Cumulative Deaths Count` = col_double(),
##   ..   `Cumulative Deaths Rate` = col_double(),
##   ..   `At least one dose count` = col_double(),
##   ..   `Completed primary series count` = col_double(),
##   ..   `At least one dose coverage (%)` = col_double(),
##   ..   `Completed primary series coverage (%)` = col_double(),
##   ..   `At least one dose count among eligible` = col_double(),
##   ..   `Completed primary series count among eligible` = col_double(),
##   ..   `At least one dose coverage (%) among eligible` = col_double(),
##   ..   `Completed primary series coverage (%) among eligible` = col_double(),
##   ..   Population = col_double(),
##   ..   `Eligible population` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(phu_information.df)

Looking at the data itself, we want to perform the following wrangling actions:

1. Rename the variable names to lower case.

2. In the variable names, replace all of the spaces with _ characters.

3. Update the values in the Geographic area variable to remove excess information.

4. Drop the variables Recent Case Count and Recent Case Rate.

5. Drop all of the vaccination variables (containing the words “Completed” or “dose”). We’ll use a selection helper matches() to accomplish this.

6. Remove the prefix word “Cumulative” from any variable names.

7. Rename the Geographic area variable to public_health_unit.

unique(phu_information.df$'Geographic area')
##  [1] "Ontario"                                                 
##  [2] "Algoma Public Health"                                    
##  [3] "Brant County Health Unit"                                
##  [4] "Chatham-Kent Public Health"                              
##  [5] "City of Hamilton Public Health Services"                 
##  [6] "Durham Region Health Department"                         
##  [7] "Eastern Ontario Health Unit"                             
##  [8] "Grey Bruce Health Unit"                                  
##  [9] "Haldimand-Norfolk Health Unit"                           
## [10] "Haliburton, Kawartha, Pine Ridge District Health Unit"   
## [11] "Halton Region Public Health"                             
## [12] "Hastings Prince Edward Public Health"                    
## [13] "Huron Perth Health Unit"                                 
## [14] "Kingston, Frontenac and Lennox & Addington Public Health"
## [15] "Lambton Public Health"                                   
## [16] "Leeds, Grenville & Lanark District Health Unit"          
## [17] "Middlesex-London Health Unit"                            
## [18] "Niagara Region Public Health"                            
## [19] "North Bay Parry Sound District Health Unit"              
## [20] "Northwestern Health Unit"                                
## [21] "Ottawa Public Health"                                    
## [22] "Peel Public Health"                                      
## [23] "Peterborough Public Health"                              
## [24] "Porcupine Health Unit"                                   
## [25] "Public Health Sudbury & Districts"                       
## [26] "Region of Waterloo Public Health and Emergency Services" 
## [27] "Renfrew County and District Health Unit"                 
## [28] "Simcoe Muskoka District Health Unit"                     
## [29] "Southwestern Public Health"                              
## [30] "Thunder Bay District Health Unit"                        
## [31] "Timiskaming Health Unit"                                 
## [32] "Toronto Public Health"                                   
## [33] "Wellington-Dufferin-Guelph Public Health"                
## [34] "Windsor-Essex County Health Unit"                        
## [35] "York Region Public Health"
# Take our PHU information and tidy it up a bit

phu_information.df %<>% 

  # Rename variables to lower case
  rename_with(str_to_lower) %>% 
  
  # Replace variable name spaces with _
  rename_with(str_replace_all, pattern=r"(\s)", replacement="_") %>% 
  
  # Rename the values in geographic_area
  mutate(geographic_area = str_remove_all(.$geographic_area, 
      pattern=r"(^Public\sHealth\s|\sPublic.*|\sand\sDistrict\s.*|\sDistrict\s.*|\sHealth.*)"                                        
      )) %>% 
  
  # Drop the "Recent" data 
  select(-2, -3) %>% 
  
  # Remove any variables containing the words "Completed" or "dose"
  select(-matches("Completed|dose")) %>% 
  
  # Rename the "Cumulative" variables
  rename_with(str_replace_all, pattern="cumulative_", replacement = "") %>% 
  
  # Rename "geographic_area" to "public_health_unit"
  rename(public_health_unit = geographic_area)

head(phu_information.df)
str(phu_information.df)
## tibble [35 x 9] (S3: tbl_df/tbl/data.frame)
##  $ public_health_unit    : chr [1:35] "Ontario" "Algoma" "Brant County" "Chatham-Kent" ...
##  $ case_count            : num [1:35] 1107408 5275 9908 6924 48526 ...
##  $ case_rate             : num [1:35] 7516 4476 6452 6494 8342 ...
##  $ hospitalizations_count: num [1:35] 42132 206 369 278 2252 ...
##  $ hospitalizations_rate : num [1:35] 286 175 240 261 387 ...
##  $ deaths_count          : num [1:35] 12497 33 76 72 523 ...
##  $ deaths_rate           : num [1:35] 84.8 28 49.5 67.5 89.9 62.3 98.2 28.4 70.8 49.8 ...
##  $ population            : num [1:35] 14734014 117840 153558 106620 581722 ...
##  $ eligible_population   : num [1:35] 14010998 112839 144945 101565 552042 ...
# View the updated PHU list
unique(phu_information.df$public_health_unit)
##  [1] "Ontario"                                   
##  [2] "Algoma"                                    
##  [3] "Brant County"                              
##  [4] "Chatham-Kent"                              
##  [5] "City of Hamilton"                          
##  [6] "Durham Region"                             
##  [7] "Eastern Ontario"                           
##  [8] "Grey Bruce"                                
##  [9] "Haldimand-Norfolk"                         
## [10] "Haliburton, Kawartha, Pine Ridge"          
## [11] "Halton Region"                             
## [12] "Hastings Prince Edward"                    
## [13] "Huron Perth"                               
## [14] "Kingston, Frontenac and Lennox & Addington"
## [15] "Lambton"                                   
## [16] "Leeds, Grenville & Lanark"                 
## [17] "Middlesex-London"                          
## [18] "Niagara Region"                            
## [19] "North Bay Parry Sound"                     
## [20] "Northwestern"                              
## [21] "Ottawa"                                    
## [22] "Peel"                                      
## [23] "Peterborough"                              
## [24] "Porcupine"                                 
## [25] "Sudbury & Districts"                       
## [26] "Region of Waterloo"                        
## [27] "Renfrew County"                            
## [28] "Simcoe Muskoka"                            
## [29] "Southwestern"                              
## [30] "Thunder Bay"                               
## [31] "Timiskaming"                               
## [32] "Toronto"                                   
## [33] "Wellington-Dufferin-Guelph"                
## [34] "Windsor-Essex County"                      
## [35] "York Region"

We’re now ready to start visualizing our data, but how will we go about doing it?

1.3.0 Which aspects of the data do we wish to highlight?

Let’s take a look at the following chart from from Dr. Andrew V. Abela.

From the flowchart above we’ll mainly explore looking at relationships and composition using our dataset as a basis for our visualizations. We’ll cover distributions and comparison with a more complex dataset later in this lecture.


2.0.0 Visualizations help us identify relationships and correlations

Depending on the nature of your data and its dimensions there are a number of plots to choose from to represent and convey relationships between your variables. These various plots can reveal trends, correlations, and ultimately relationships between variables. For instance, as an initial form of graphical assessment, scatterplots build a framework for exploring our data further.

Relationship Description Types of charts
Finding correlations in your data Scatterplot
Bubble chart
Lineplot
Heatmap
Showing connections between groups Arc diagram
Chord diagram
Connection map
Network diagram
Non-ribbon chord diagram
Tree diagram
Show relationships and connections between the data Heatmap
or showing correlations between two or more variables Marimekko Chart
Parallel coordinates plot
Radar chart
Venn diagram

In our first lecture, we covered the creation of lineplots. For today, we’ll focus on just two relational plots: the scatterplot and it’s variant the bubblechart. Parallel coordinate plots will also make an appearance later on during this lecture.


2.1.0 Scatterplots and bubblecharts are geom_point() graphs

When we try to examine relationships, one direction we can take is to look for trends by comparing one variable against another. From an experimental standpoint we can graph our independent variable on the x-axis, looking for changes in our dependent variable/measurement on the y-axis. This is most easily accomplished when both of your variables are on a continuous scale.

When trying to show correlations in your data between two to three variables you can use scatterplots. Remember there are some limitations when visualizing multi-dimensional data on a two-dimensional canvas. Overall these plots can convey to your audience the potential correlations between your variables or separations between groups based on their colour or shape.

2.1.1 Use the alpha parameter to help visualize overlapping data points

Let’s begin with a scatterplot comparing the number of cases versus hospitalizations. We’ll use the observations from each PHU as separate datapoints. Of note, within the geom_point() layer, we’ll be updating the alpha parameter to change the transparency of our points.

Using the “alpha” parameter: Note that alpha = 1 will make completely opaque points while alpha = 0 will make completely transparent points. As points begin to overlap, they will create increasingly opaque areas in your visualization.

# scatter plot of case_count vs hospitalizations_count
phu_information.df %>% 
  # Drop the first row of "Ontario" data
  slice(2:n()) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x=case_count, y = hospitalizations_count) +
  
      # set text size
      theme(text = element_text(size = 10)) +
  
      # 4. Geoms
      geom_point(alpha = 0.5)


2.1.2 The bubbleplot adds an obvious dimensionality to your data

The bubbleplot, as we’ll see, brings an additional dimension to your visualization by varying the size of your points based on a (usually) continuous variable. This can help to highlight underlying data trends in a more obvious fashion for your audience.

To achieve this we will set the size parameter in our aesthetics aes() layer. By setting the size parameter to a variable in our data, it will alter the size or our data points. We will augment the results of this by providing a size range through the scale_size() layer.

# bubble plot of case_count vs hospitalizations_count
phu_information.df %>% 
  # Drop the first row of "Ontario" data
  slice(2:n()) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x=case_count, y = hospitalizations_count, 
          size = population) +                       ## Set the size aesthetic
  
      # set text size
      theme(text = element_text(size = 10)) +
  
      # 3. Scaling
      ## set the range for sizing our dots
      scale_size(range = c(1,10), name = "Population (M)") + 
  
      # 4. Geoms
      geom_point(alpha = 0.5)


2.2.0 Scale your axes to help see your data better

From above we can see a lot of values are bunched up together at the lower end of our scales and likewise we have a few values that are far out on our x/y axes. To even this out, we can transform our axes to display values on a log10 scale. We have two places where we can accomplish this:

  1. Transform your data directly ie log(case_count). Data will be placed on a log value scale but may have less meaning to the lay person.
  2. Transform your scales/tick marks. The data is untouched and the values still have meaning to your audience without having to do conversions in their head.
# bubble plot of case_count vs hospitalizations_count
phu_information.df %>% 
  # Drop the first row of "Ontario" data
  slice(2:n()) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x=case_count, y = hospitalizations_count, size = population) +
  
      # set text size
      theme(text = element_text(size = 10)) +
  
      # 3. Scaling
      scale_size(range = c(1, 10), name = "Population (M)") + # set the range for sizing our dots
  
      ## Set the x and y-axis to log scales
      scale_x_log10() + 
      scale_y_log10() + 
  
      # 4. Geoms
      geom_point(alpha = 0.5)


From our bubbleplot it is clear that rising case counts are tied to increasing population size. The relationship between our log-transformed data appears to be linear.

How do you interpret log-transformed data?: After log-transforming both axes it looks like the relationship between overall cases and hospitalizations is linear. While the transformation has definitely improved the visualization, it has made the interpretation a little harder. Overall, in this situation, you’ll either want to model against the untransformed data or do the math correctly and recognize that the relationship is a power function:

\[\begin{equation*} \begin{aligned} \text{log} y &= B + a \text{log} x &\text{where B is the intercept of the vertical axis and a is the slope}\\[10pt] y &= 10^{(B + a \text{log} x)}\\[10pt] y &= 10^{B} 10^{a \text{log} x}\\[10pt] y &= 10^{B} x^{a} &\text{Recall that } y \text{ln} x = x^{y}\\[10pt] \end{aligned} \end{equation*}\]

3.0.0 Distribution plots display frequency and spread across groups or intervals

Many students are familar with the classic bar chart. Thus far, we’ve already used it to some extent to look at our last lecture’s PHU case data. When comparing groups or populations for differences in their distribution, however, the bar chart falls quite short of the ideal.

When comparing distributions, we again are often concerned with summary statistics - mean, median, standard deviation, confidence intervals. The nature of our data, however, is often not continuous across an entire y-axis interval as a bar chart may suggest but rather our population is the result of values centred, with some variance, around a mean value.

Over the remainder of the lecture we will compare and contrast 4 types of distribution plots for their relevance and when they may be most useful to visualize our data. When applicable to our data visualizations, we will also display all of our data points to better illustrate our distribution versus the visualization. We will examine:

  1. Bar charts or Barplots
  2. Density plots
  3. Box and Whisker plots
  4. Violin plots

3.1.0 The barplot uses the geom_bar() or geom_col() to display data

As we have already seen, the geom_bar() layer can be used to display our data in multiple ways. The height of the bar is proportional to either:

1. the number of observations of each group or

2. the sum of weights applied by a weight aesthetic

In our Lecture 01 examples we actually used the value of our observations to determine height and proportions of our groups by setting the parameter stat="identity". This was the application of a weighting based on the values of the y-axis variable we chose.

A simpler way to accomplish this effect is with geom_col(), which already uses stat_identity() by default to calculate proportions.

geom Description Requirements
geom_bar() Counts observations in your data and (by default) determines height as a proportion of total (by default) Only accepts x OR y parameter in aes()
geom_col() Uses y-axis values as the height of each bar. Requires both x AND y parameters in aes()

Both of these plots use position="stack" by default and proportions of height match observations or sums for multiple values sharing the same x position. Such instances can be displayed independently using position="dodge" or position=dodge2. This is only helpful, however, when the number of x values (ie categories) is lower, otherwise the graph becomes crowded.

Let’s begin with a simple barplot of SARS-Cov-2-related deaths accumulated over the course of the pandemic. Using phu_information.df as a data source we will convey the distribution of total deaths per PHU using the geom_col() layer.

Recall the first row of our data is “Ontario”, representing the total values for each variable as a sum of the other PHUs. We’ll remove that from our data using slice() before passing it on to ggplot().

# barplot of death counts across PHUs
phu_information.df %>% 
  # Remove the top row (Ontario data)
  slice(2:n()) %>% 
  
  # 1. Data
  ggplot(.) +
      # 2. Aesthetics
      aes(x = public_health_unit, y = deaths_count) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      # 4. Geoms
      geom_col()    ## Add our bars


3.1.1 Use reorder() to sort your data as you plot it

In the case of phu_information.df we can do a little better by ordering our x-axis of PHUs by total population or death counts. Since our data table is quite simple and only a single observation for each PHU occurs, there are a number of ways to accomplish a sort:

  1. Convert population to a factor and use fct_reorder() to sort our factors.
  2. Sort the entire table using arrange() before plotting.
  3. Use the reorder() function while generating the plot.

Let’s sort our data by PHU population size in our next example using the R base function reorder() within our ggplot() call.

# barplot using the reorder function
phu_information.df %>% 
  # Remove the top row (Ontario data)
  slice(2:n()) %>% 
  
  # 1. Data
  ggplot(.) +
      # 2. Aesthetics
      ### 3.1.1 We must choose a variable to sort by in the reorder function - use population
      aes(x = reorder(public_health_unit, -population), 
          y = deaths_count) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      # 4. Geoms
      geom_col()


3.2.0 Barplots can help guide the eye

From above, using the baplots we can quickly gauge the difference between PHUs as they get sorted by population. There’s not much need to colour by population as well but it does add some emphasis to Toronto as a larger population and further enforces the idea that we have sorted by population size.

What are we missing from this visualization that would help the reader? It would be nice to know just how much variation there is in population size, especially in the first ~12 PHUs. How much bigger is Peel versus the Region of Waterloo?

3.2.1 Use the fill parameter to distinguish between groups

One simple way to enhance our barplot is through the use of fill colour. By setting the fill parameter to population we can shade each bar based on the population size of each PHU. Let’s take a look.

# barplot using the reorder function
phu_information.df %>% 
  # Remove the top row (Ontario data)
  slice(2:n()) %>% 
  # We must choose a variable to sort by in the reorder function - use population
  
  # 1. Data
  ggplot(.) +
      # 2. Aesthetics
      aes(x = reorder(public_health_unit, -population), y = deaths_count, fill = population) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      ### 3.2.1 Adjust the fill legends
      guides(fill = guide_colorbar(title="Population\nsize")) +
  
      # 4. Geoms
      geom_col()


3.2.2 Add extra geom_* layers to more accurately reflect values

As you can see from above, unless you have a very good eye for shades, there isn’t much to help us distinguish between populations again. We can still see that Toronto has closer to more than 3M residents but other than that, we again have some issues with discriminating between populations.

Instead of fill colour, we can add a little detail by layering an additional geom. In this case, we will use geom_point() to add the population of each PHU after scaling it down to 1/1000th.

Why scale population by 1/1000th? Why did we choose to scale the population size of each PHU to 1:1000? This is more of a decision based on what we know about the data already. We already know our y-axis deaths_count data scales from 0 to 4000. Knowing the populations can scale from 0 to 3M, it makes sense to try and get our values along the same scale.

# barplot using the reorder function
phu_information.df %>% 
  # Remove the top row (Ontario data)
  slice(2:n()) %>% 
  # We must choose a variable to sort by in the reorder function - use population
  
  # 1. Data
  ggplot(.) +
      # 2. Aesthetics
      aes(x = reorder(public_health_unit, -population), y = deaths_count) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      guides(size = guide_legend(title="Population\nsize per\n1000"), 
             colour = "none") +
  
      # 4. Geoms
      
      geom_col() +
  
      ### 3.2.2 Add a geom_point to represent population size
      geom_point(aes(y=population/1000, size=2, colour="red"))


3.2.3 The lollipop plot: a sweet twist on the barplot.

Now we have more clarity on populations sizes between PHUs, answering our original question that Peel region has nearly one million more inhabitants than the Region of Waterloo. To publish this figure we would want to fix a few additional things like the legend presentation, the axis names and their titles. We’ll drill into these ideas more next lecture!

On this same topic, let’s visit one last plot that can gives us some pieces from the two variants of barplots we’ve used. The lollipop graph clarifies that x-axis values are more singular in nature rather than spanning a range while still visually connecting those y-axis values to their x-axis categories.

It looks very much like it sounds, and we’ll add an extra twist to ours by setting the point size to the population size, giving it just a bit more of informational dimension. To accomplish this visualization we’ll combine a geom_point() with a geom_segment().

Aesthetics when working with multiple geoms: As you’ll see in the following code, we’ve made some adjustments due to working with multiple geom_* layers. Since there can be overlapping aesthetics between geoms, you need to be cognizant of their effects. Rather than set these parameters in the aes() layer, set them directly in their respective geoms.

# scatter-style using the reorder function
phu_information.df %>% 
  # Remove the top row (Ontario data)
  slice(2:n()) %>% 
  # Arrange our data here to clean up the code
  arrange(desc(population)) %>% 
  # Use the updated order to reorder the public_health_unit variable
  mutate(public_health_unit = factor(public_health_unit, levels = .$public_health_unit)) %>% 
  
  ggplot(.) +
      # 2. Aesthetics
      aes(x = public_health_unit, y = deaths_count) +
      
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      guides(size = guide_legend(title="Population\nsize"), 
             colour = "none") + # Set color legend to none to remove it 
  
      # 3. Scaling
      # set the range for sizing our dots
      scale_size(range = c(1, 10), name = "Population (M)") + 
  
      # 4. Geoms
      ### 3.2.3 Make the stick of the lollipop
      geom_segment(aes(x=public_health_unit, xend=public_health_unit, 
                       y=0, yend=deaths_count)) +
  
      ### 3.2.3 Put the candy on top
      geom_point(aes(size = population), colour = "orchid")


3.3.0 Barplots to convey proportion or composition

Let’s revisit our public health unit data from last lecture. We’ll use an updated long-format version that we created and saved in a csv format. As you might recall this dataset contains a column of dates with each row representing an observation of new_cases reported by a public_health_unit on that date. We have a 4th column total_phu_new representing total cases reported across all PHUs on that specific date.

date public_health_unit new_cases total_phu_new
Date format: YYYY-MM-DD factor of 34 PHUs numeric:double numeric:double

In our first lecture we looked at the other helpful visualization that barplots can produce: composition and proportions. Here we can combine the proportions of categories of data to help convey an added dimension to our data. By stacking PHUs in our “new case” data we are now plotting new cases per month for multiple PHUs. The area of each stack, however, also gives us a sense of proportions for each PHU that we’ve added.

# Open our data file and define the column types rather than let the function decide.
covid_phu.df <- read_tsv("./data/Ontario_daily_change_in_cases_by_phu_long.tsv", 
                         col_types=("Dfdd")) # Define column types here

# Take a peek at the data
head(covid_phu.df)
tail(covid_phu.df)

3.3.1 Working with Dates and the lubridate package

Last week we breezed over the idea of working with the date format in R. It can be a rather esoteric subject but it all comes down to how we as a society track dates. For R, the date is stored as an integer values representing the number of days since 1970-01-01. So yes, there must exist some “negative” dates.

The lubridate package is here to help you simplify working with dates and has a number of helpful functions that can be used to extract information from your date. For us, we’ll use the floor_date() function to help round our dates down to their Year-Month format. This function takes just two parameters:

  • date: your date or vector of dates to convert

  • unit: the unit by which you want to round down - this could be year, month, …, second (if your date is tracking that specifically)

Note that our dates will still be in a date object format!

# What do our dates look like?
head(covid_phu.df$date)
## [1] "2020-01-23" "2020-01-23" "2020-01-23" "2020-01-23" "2020-01-23"
## [6] "2020-01-23"
# What do they look like after rounding down by month?
floor_date(covid_phu.df$date, unit = "month") %>% head()
## [1] "2020-01-01" "2020-01-01" "2020-01-01" "2020-01-01" "2020-01-01"
## [6] "2020-01-01"

3.3.2 Use the scale_x_date() layer to set your x-axis information

Let’s use this to convert our date values for plotting as bar graphs! We’ll convert our x-axis values in our aes() call. Our data only goes out to January 25, 2023 so we’ll set our limits to accommodate that point.

Note also, our scale_x_date() layer. We used this last week to help set how our x-axis is displayed the data including parameters like:

  • limits: where our x-axis begins and ends

  • date_breaks: how we would like our x-axis ticks to break up

  • date_labels: how we want our x-axis ticks to be labeled

  • expand: determine if we would like additional padding on the sides of our x-axis. This is a 4-element vector but can also accept 2 values c(mult, add) where you may either expand outwards by multiplying limits by mult or you can pad the limits with the unit value (days in this case) of add.

Through some experimentation, we’ll set our limits to ~1/2 month before our start date and end dates. We’ll aim to display our data from December 2022 to January 2023 but remember that this data is floored to the first date of each month!

# Start with our PHU covid data
covid_phu.df %>% 
  # Filter to just look at a few PHUs
  filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>% 
  
  # 1. Data
  ggplot(.) +
      # 2. Aesthetics
      aes(x = floor_date(date, unit = "month"), ## Floor the date
          y= new_cases, 
          # set our fill colour instead of line colour
          fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) + 
  
      theme(text = element_text(size = 20)) + # set text size
  
      guides(fill = guide_legend(title="Public Health Unit")) +
      xlab("Date") + # Set the x-axis label
      ylab("New cases") + # Set the y-axis label
      ggtitle("New cases per day across all Ontario Public Health Units") +
  
      # 3. Scaling
      ### 3.3.2 Scale our dates from mid-November to Mid-January.
      scale_x_date(limits = c(as.Date("2021-11-15"), as.Date("2023-01-15")), 
                   date_breaks = "1 month", date_labels = "%b-%Y", 
                   expand = c(0,5)) + # Pad the x-axis by 5 days
      scale_fill_viridis_d() + # the "d" stands for discrete colour scale
  
      # 4. Geoms
      geom_col() # Set up our barplot here
## Warning: Removed 2034 rows containing missing values
## (`position_stack()`).


3.4.0 Convert your plots to a circular layout with coord_polar()

Generally speaking there are a number of circular or polar coordinate plot variants ranging in complexity from the pie chart to a racetrack chart. The coord_polar() layer takes a few parameters:

  • theta: determines if angles will be mapped to the x or y variable
  • start: offset from the 12 o’clock position
  • direction: 1 = clockwise, -1 = counter-clockwise

Here’s a summary

Chart name Plot description Uses Cartesian analog
Pie chart Sliced up proportions of a circle Simple proportion comparison between groups single bar variant
Nightingale/Coxcomb plot Intervals are equal wedges but shaded areas represent proportions Intervals with additional categorical proportions Stacked column plot, theta=x
Race track plot Intervals are split concentrically to form rings of equal width Helpful if some groups are less diverse Stack column plot, theta=y

3.4.1 Pie charts work best with simple data

From our pie chart description and from your own experience, you can recall that pie charts aren’t helpful with complex data. Once the number of categories surpasses 6-8 groups, things can get hard to visually digest - especially if one category dominates the others.

Let’s use phu_information.df to assess the cumulative COVID-19 case data from across our PHUs to look at the top 4 PHUs by caseload. We’ll convert a geom_bar() barplot to our pie chart using the coord_polar() layer.

To accomplish this feat it helps to think about how the data is being handled. If you think about a pie chart, you can imagine it much like a stacked barplot where the top and bottom have been curved around to meet each other. This is a plot all about proportions and instead of having an x-axis defined, it is the proportions for a single category or group.

Therefore, we set the aes() parameter x to a value of "" so there are no categorical values for x but you will still need to set a y value from which to determine those proportions.

# Go to the simple phu information table
phu_information.df %>% 
  # Sort for only a few PHUs
  filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    ### 3.4.1 Set the x axis correctly
    aes(x="", y=case_count, 
        fill = public_health_unit) + # set our fill colour instead of line colour

    theme_void() + # This will remove all background theme objects
    theme(text = element_text(size = 20)) + # set text size

    guides(fill = guide_legend(title="Public Health Unit")) +
    xlab("") + # Set the x-axis label
    ylab("New cases") + # Set the y-axis label
    ggtitle("Total cases by health unit") +

    # 3. Scaling
    scale_fill_viridis_d() + # the "d" stands for discrete colour scale

    # 4. Geoms
    geom_bar(width=1, stat="identity", colour = "white") + # Set up our barplot here

    # 7 Coordinate system
    ### 3.4.1 Map angles to the Y-axis
    coord_polar(theta="y")


3.4.2 Nightingale rose charts emphasize proportions

Made famous by Florence Nightingale, these plots (which she named coxcomb plots) were used to help emphasize the proportions of soldier deaths due to infection during the Crimean war. Each circular increment is equally spaced to represent increasing values along the y-axis. Visually, this gives more area representation as we radiate outwards on the chart. This produces a chart that

  1. Makes distinguishing larger proportions more obvious.
  2. Adds a new dimension to the pie chart, allowing us to create intervals for comparison.
  3. Helps connect our start and end category, especially if they don’t represent an ordinal or continuous scale.

Florence Nightingale’s original rose chart/coxcomb plot publication. Sourced from the Wikimedia commons

Caution: outer segments disproportionately represent increases in value. Their larger area produces more visual emphasis despite the linear scale of the y-axis.

We’ll switch back to our PHU daily case data from covid_phu.df to generate some stackable information.

Know your y-limits! Note that due to the disproportionate area increases with increasing y-axis values, you need to consider the range of your data. We’ve seen the high-values in our January 2022 data so we’ll set the x-axis scale again with scale_x_date() to range from February 2022 to January 2023.

# Start with our PHU covid data
covid_phu.df %>% 
  # Filter to just look at a few PHUs
  filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = floor_date(date, unit = "month"), y= new_cases, 
        # set our fill colour instead of line colour
        fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) + 

    theme_bw() + # make the theme more plain
    theme(text = element_text(size = 20)) + # set text size
    theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid

    guides(fill = guide_legend(title="Public Health Unit")) +
    xlab("Date") + # Set the x-axis label
    ylab("New cases") + # Set the y-axis label
    ggtitle("New cases per day across all Ontario Public Health Units") +

    # 3. Scaling
    scale_x_date(limits = c(as.Date("2022-01-15"), as.Date("2023-01-15")),
                            # Scale the dates correctly to get a month-Year format
                            date_breaks = "1 month", date_labels = "%b-%Y", expand = c(0,0)) + 
    scale_fill_viridis_d() + # the "d" stands for discrete colour scale
    
    # 4. Geoms
    ### 3.4.2 Set up our barplot here
    ... + 

    # 7 Coordinate system
    ### 3.4.2 Convert the barplot to a polar coordinate
    ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

3.4.3 Racetrack plots are an aesthetic variant of the bar chart

You want to make another cool chart from a boring bar chart. Stacked or otherwise, you can convert those bar plots to a racetrack or radial bar chart. The only difference between the coxcomb plot and radial bar chart is that instead of the default theta="x" we set it to y. Like flipping the coordinates for a bar chart.

Beware the racetrack transformation: This is an aesthetic transformation where each bar gets relatively longer as you move from inside to out. Therefore, values must be judged by radians! Our eyes can more precisely judge differences on a Cartesian bar chart. Thus when viewing these types of charts, don’t be misled by how they look at a casual glance!

Let’s make a set of barchart data and compare it to the racetrack plot. Note that negative values in our dataset are plotted separately. It is an implementation quirk of ggplot.

# Start with our PHU covid data
covid_phu.df %>% 
  # Filter to just look at a few PHUs
  filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = floor_date(date, unit = "month"), y= new_cases, 
        # set our fill colour instead of line colour
        fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) + 

    theme(text = element_text(size = 20)) + # set text size

    guides(fill = guide_legend(title="Public Health Unit")) +
    xlab("Date") + # Set the x-axis label
    ylab("New cases") + # Set the y-axis label
    ggtitle("New cases per day across all Ontario Public Health Units") +

    # 3. Scaling
    scale_x_date(limits = c(as.Date("2022-01-15"), as.Date("2023-01-15")),
                 date_breaks = "1 month", date_labels = "%b-%Y", expand = c(0,5)) +
    scale_fill_viridis_d() + # the "d" stands for discrete colour scale
    
    # 3. Scaling
    ... + ### 3.4.3 Flip the x and y axes

    # 4. Geoms
    geom_col() # Set up our barplot here
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Start with our PHU covid data
covid_phu.df %>% 
  # Filter to just look at a few PHUs
  filter(public_health_unit %in% c("Toronto", "Peel", "York Region", "Ottawa")) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = floor_date(date, unit = "month"), y= new_cases, 
        # set our fill colour instead of line colour
        fill = fct_reorder(public_health_unit, new_cases, .desc=TRUE)) + 

    theme_bw() + # make the theme more plain
    theme(text = element_text(size = 20)) + # set text size
    theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid

    guides(fill = guide_legend(title="Public Health Unit")) +
    xlab("Date") + # Set the x-axis label
    ylab("New cases") + # Set the y-axis label
    ggtitle("New cases per day across all Ontario Public Health Units") +

    # 3. Scaling
    scale_x_date(limits = c(as.Date("2022-01-15"), as.Date("2023-01-15")),
                 # Scale the dates correctly to get a month-Year format
                 date_breaks = "1 month", date_labels = "%b-%Y", expand = c(0,5)) + 
    scale_fill_viridis_d() + # the "d" stands for discrete colour scale
    
    # 4. Geoms
    geom_col() + # Set up our barplot here

    # 7 Coordinate system
    ### 3.4.3 No need to flip the coordinates, just set theta=y
    coord_polar(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

3.4.4 Barplots and their variants summarize the distribution of data between groups or intervals but NOT within single populations.

Often for our purposes as scientists we are more concerned with the distribution of replicates or measurements as a population within a group whilst also comparing across other groups (ie control vs treatment). Naively, we are tempted by the convenience of Excel to simplify our data as a bar chart with some simple standard deviation or error bars. One word concisely describes this behaviour:

\[\text{Inappropriate}\]

Note from our examples that although we are comparing public health units, our datapoints represent single or total observations separated either by group or by time. At each x-axis point we are not comparing the distribution between categories in a meaningful way. In fact, from our first example, we can convey a near-exact message using a dot-plot! In essence these visualizations are communicating the categorization of samples across multiple groups. More or less, this is about visualizing proportions.

Section 3.0.0 Comprehension Question: After exploring the racetrack variant above, visualizing our data ranging from February 2022 to January 2023, what is the biggest issue you might see? What would be one way to remedy this? Is this an appropriate visualization of our data?

# Code your plot here!

4.0.0 Density plots compute a theoretical distribution

What is the shape of our data? When we encounter experimental populations and begin sampling or measure for specific characteristics, we inevitably begin to reveal whether or not that characteristic has a predictable range, median value, etc. Density plots, once given enough sample values, begin to predict the shape or distribution of that sampling. We sometimes use this to represent the theoretical distribution of our original populations.

4.0.1 Distribution plots need appropriate data

In contrast to our previous datasets, we are interested in dissecting group characterstics after sampling multiple times. Therefore, before continuing on our journey, we need to find some data that is better analysed as population data. Let’s open up some more SARS-CoV-2 data from Ontario PHUs that has been broken into age and sex demographics: Ontario_age_sex_COVID-19_cases.csv. We’ll see what we can glean from the data.

Before we can do any visualization we’ll want to tidy up the data in the following steps:

  1. Import the data with read_csv()
  2. Fix the column names to remove spaces and replace them with “_”
  3. Pare down on the columns we want to look at. In this case we’ll be focusing only on three indicators: total cases, hospitalizations and cases broken down into either male or female categories.
  4. Summarize the data into a proportion of cases for each age group, based on the specific total for each indicator within each PHU.

After all of our transformations, we’ll have specific proportions of infections, deaths, and hospitalizations within each PHU for each age group.

# Open our dataset
covid_demographics.df <- read_csv(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a quick look at it
str(covid_demographics.df, give.attr = FALSE)
## Error in str(covid_demographics.df, give.attr = FALSE): object 'covid_demographics.df' not found
tail(covid_demographics.df)
## Error in tail(covid_demographics.df): object 'covid_demographics.df' not found
# Format our dataset to focus on total_cases and total_deaths
covid_demographics_total.df <- 
covid_demographics.df %>% 

#---------- Fix column names ----------#
# convert column names to lowercase and replace spaces with "_". Does this look familiar?
rename_with(., ~ str_to_lower(str_replace_all(string = .x, 
                                              pattern = r"(\s)", 
                                              replacement = "_"))) %>% 

# clean up PHU names
mutate(geographic_area = str_remove_all(geographic_area, 
                                        pattern=r"((^Public\sHealth\s)|(\sPublic.*)|(\sHealth.*))"),
       period = str_remove_all(string = period, pattern = r"(\sCOVID.*)")
                                        ) %>% 

#---------- Filter and fix variables ----------#
# Filter out some of the excess data
filter(geographic_area != "Ontario", # Drop Ontario data
       age_group != "All ages") %>%  # Drop combined age data

# Pare down the dataset to just total_cases, total_deaths, and total_hospitalizations
select(period, from_date, to_date, geographic_area, age_group, 
       total_cases, total_rate, 
       total_hospitalizations_count, total_hospitalizations_rate, 
       male_cases, female_cases) %>% 

# Convert the age_group into a factor
mutate(age_group = factor(age_group),
       # after converting to factor, the "5 to 11" will be in the wrong position. Relevel age_group
       age_group = fct_relevel(age_group, "5 to 11", after=1)) %>% 

# Rename the geographic_area variable
rename(public_health_unit = geographic_area) %>% 

#---------- Summarize data ----------#
# Group the data so you can manipulate based on PHU in the next steps
group_by(period, public_health_unit) %>% 

# generate percent values for each age group within a PHU
mutate(percent_cases = ..., 
       # generate percent hospitalizations for each age group within a PHU
       percent_hospitalizations = total_hospitalizations_count/sum(total_hospitalizations_count),
       # generate percent male and female cases for each age group within a PHU
       # We'll use this data later!
       percent_male_cases = male_cases/(total_cases),
       percent_female_cases = female_cases/(total_cases)
      ) 
## Error in covid_demographics.df %>% rename_with(., ~str_to_lower(str_replace_all(string = .x, : '...' used in an incorrect context
# Take a look at the different age demographics
levels(covid_demographics_total.df$age_group)
## Error in levels(covid_demographics_total.df$age_group): object 'covid_demographics_total.df' not found
str(covid_demographics_total.df, give.attr = FALSE)
## Error in str(covid_demographics_total.df, give.attr = FALSE): object 'covid_demographics_total.df' not found

4.1.0 Density and histogram plots model distribution of samples or individuals within a population

While bar plots help to focus on proportional representation for categorical data, both density plots and histograms can be used to convey the frequency or distribution of values for a variable across its specified range. When comparing distributions visualized this way, you can usually compare up to 3 or 4 on the same plot before the data becomes too crowded. These methods also give you a sense of your data before moving forward with more detailed analyses. You can also identify possible mistakes or artefacts in data collection.

How much data do I need? Density plots can be thought of as smoothed versions of histograms which have been binned in small intervals. Density plots of a single dimension require a minimum of 4 samples but justifying a KDE on a sample size that small is hard. Histograms are recommended to have at least 30 observations to be considered useful and I would apply this rule of thumb to KDEs as well.

Let’s pick a few age groups to plot based on percent_cases. We’ll use data gathered from each PHU to see if the same general trends (if any) apply regardless of PHU.

covid_demographics_total.df %>% 
  # Filter for just a few age groups
  filter(period == "cumulative", 
         age_group %in% c("0 to 4","12 to 19", "20 to 39", "80+")) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = ..., fill = ...) +

    theme(text = element_text(size = 20)) + # set text size

    # 4. Geoms
    geom_density(...) + ## generate our KDE
    ...(aes(colour = age_group)) ## confirm our data values with a geom_rug
## Error in filter(., period == "cumulative", age_group %in% c("0 to 4", : object 'covid_demographics_total.df' not found

4.1.1 Apply a facet_*() to view KDEs separately

As you can see from above, as we start to have more and more age groups, it may be better to separate them out in order to avoid too much overlap. It may also be to our advantage to compare them in a more vertical fashion. Recall that there are two layers we can choose from: facet_wrap() and facet_grid(). They are differentiated by the following characteristics:

  • facet_wrap(): Data is split based on available data combinations of the facets parameter which can be defined by a formula like ~var1+var2 or by using the quoting function vars(var1, var2). In either case, the data is then grouped by your variables. Panels are placed in a single ribbon that wraps around based on the arguments in the ncol and nrow parameters.
  • facet_grid(): Data is split based on the 1 or 2-dimensional combination of facet variables. Even combinations for which there is no data will be displayed as empty panels. Use the rows and cols parameters to set the facets by using the vars() quoting function. Note that you could further group your rows and/or columns by multiple variables.

Note we’ll also use another parameter in both called scales where we can determine if panels should share axis scales or change them based on individual panel data in the y-axis (free_y), x-axis (free_x) or both (free),

Let’s first use facet_wrap() to generate a single-column facet of KDEs based on the cumulative period from our data.

covid_demographics_total.df %>% 
  # filter for only cumulative data
  filter(period == "cumulative") %>% 
  # Select for just the important columns
  select(public_health_unit, age_group, percent_cases) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = percent_cases, fill = age_group) +
  
    theme(text = element_text(size = 20)) + # set text size
  
    # 4. Geoms
    geom_density(alpha = 0.5) + 
  
    # 6. Facets
    ### 4.1.1 Add a simple facet wrap, with each age group existing in its own row
    facet_wrap(..., 
               scales=..., 
               ncol=...)
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

4.1.2 Use facet_grid() to view both periods of our data

If, for example we wanted to directly compare the data from both periods (recent and cumulative), we could alter the above plot so that our two periods are paneled beside each other. This can be accomplished with the facet_grid() layer which will allow us to name both a rows and cols grouping parameter.

Let’s update our figure so that we split our rows by age_group and our columns by period.

covid_demographics_total.df %>% 
  # Select for just the important columns
  select(period, public_health_unit, age_group, percent_cases) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = percent_cases, fill = age_group) +
  
    theme(text = element_text(size = 20)) + # set text size
  
    # 4. Geoms
    geom_density(alpha = 0.5) + 
  
    # 6. Facets
    ## Use a facet_grid to panel data by multiple variables
    facet_grid(rows = ..., 
               cols = ...,
               scales="free") # Allow allow scales to alter between rows, and between columns.
## Error in select(., period, public_health_unit, age_group, percent_cases): object 'covid_demographics_total.df' not found

While the above visualization is pretty clear, it certainly takes up a lot of extra space. Perhaps there is a better way to generate this kind of visualization?

4.2.0 Plot multiple distributions with a ridgeline plot

Ridgeline plots (sometimes called Joyplots) can generate a compact way to show multiple distributions of numeric values across several groups. The distributions can be shown as either histograms or density plots with all of them aligned to the same horizontal axis. The vertical axis is compressed slightly to generate an overlap between categories.

To generate these visualizations we can use the ggridges package which is an extension of ggplot2. In this case, that means it uses the same grammar and can be added as a layer call to geom_density_ridges(). A parameter to keep in mind:

  • scale: sets the vertical distance between ridgelines

    • 1: the tallest density curve just touches the baseline of the next vertical category

    • Above 1: increasing overlap

    • Below 1: increasing separation

Let’s begin by replicating our faceted plot from above which compares the cumulative vs recent data across age groups.

covid_demographics_total.df %>% 
  # Select for just the important columns
  select(period, public_health_unit, age_group, percent_cases) %>% 
  
  # Plot a ridgeline plot
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = percent_cases, y = age_group, fill = age_group) +
  
    theme_bw() + # Simplify the underlying theme
    theme(text = element_text(size = 20)) + # set text size
    theme(legend.position = "none") + # drop the legend since it's redundant
  
    # 4. Geoms
    ### 4.2.0 Add a ridgeline plot instead of KDE
    ... + 
  
    # 6. Facets
    ### 4.2.0 Make panels to display our period data
    facet_grid(..., scales = "free_x")
## Error in select(., period, public_health_unit, age_group, percent_cases): object 'covid_demographics_total.df' not found

4.3.0 Use geom_density_ridges_gradient() to fill densities on a gradient

From above you can now see that we’ve somewhat compacted all that dimensional data into a single plot that still clearly conveys the difference in overall proportions for total infections within each age group. The distributions across our categories suggest that the 20-39 age group makes up a larger proportion of overall cumulative cases within each PHU. On the other hand, the 20-80 age ranges all appear to have similar distributions of cases in the most recent data, although that may be a less reliable indicator of the true case spread.

For our audience, we would need to clean up our axis titles to clarify that these proportions are calculated independently. An additional option with your ridgeline plots is the fill variant. To accomplish a nicer gradient we will include a call to scale_fill_viridis_c since our x-axis is continuous. Keep in mind that we also cannot set the alpha transparency on our density plots when filling with a gradient. We also have to set our aesthetics fill to stat(x) to accomplish this feat as well.

covid_demographics_total.df %>% 
  # Select for just the important columns
  select(period, public_health_unit, age_group, percent_cases) %>% 
  
  # Plot a ridgeline plot
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = percent_cases, y = age_group, fill = stat(x)) +
  
    theme_bw() + # Simplify the underlying theme
    theme(text = element_text(size = 20)) + # set text size
    theme(legend.position = "none") + # drop the legend since it's redundant
  
    # 3. Scaling
    ### 4.3.0 Use the Magma option 
    scale_fill_viridis_c(name="Percent of PHU", option = "C") + 
  
    # 4. Geoms
    ## Use a gradient version of ridges instead
    ... + 
  
    # 6. Facets
    facet_grid(~period,           # Equivalent to row = vars(period)
               scales = "free_x")
## Error in select(., period, public_health_unit, age_group, percent_cases): object 'covid_demographics_total.df' not found

5.0.0 Categorical distribution plots

Did you know the boxplot is nearly 50 years old! First invented in the 1970s by our favourite statistician, John Wilder Tukey, we’ll dig into how and when to use this iconic plot. While we’re here we’ll also take a look at other categorical distribution plots. While our KDE and ridgeline plots provide quite a bit of detail, they can also be a little more limited in their space efficiency. The following categorical distribution plots will perhaps provide some more information efficiency.

5.1.0 Summarize population distributions with geom_boxplot()

As you can see from the previous section, we comfortably fit a quite few distributions on a ridgeline plot. From the looks of it, the 20-39 age group looks to make up a higher percent of cases across all the PHUs. Previously this data was broken down by 10-year groupings but it has since been amended to make larger age groups in the 20+ range. Still, we can still explore this data a little closer.

Can we visualize the data in a more summarized form? Let’s explore the box plot.

The dissection of a boxplot’s components shows us how it summarizes data distribution.

Also known as the box and whisker plot, this visualization conveys the distribution of samples within a group or population and is built upon 5 principal values:

  • “Box”

    • median: dark line across centre of box

    • lower quartile: lower-bound of box

    • upper quartile: upper-bound of box

  • “Whiskers”

    • lower extreme: length of lower whisker
    • upper extreme: length of upper whisker

Together, the lower and upper quartiles produce the interquartile range (IQR). The general implementation of boxplots classify any observations 1.5 IQR above the upper quartile or below the lower quartile as outliers of the distribution. The characteristics of outliers can be set as parameters within the geom_boxplot() layer. Parameters include outlier.shape, outlier.size, and outlier.colour.

Unlike a histogram, the minimum number of values to generate a boxplot is 5. While you could generate a boxplot on fewer numbers, you might not have actual whiskers! This is definitely a great alternative when sample sizes are between 5-30 for each population.

Historically this was a simple way to visualize summary statistics of population while being easy to produce by hand. Of course, with the age of computing, the production of kernel density estimates have allowed for more diverse visualizations. This plot, however, remains a popular format and thus is more readily understood by general audiences.

Each compact box can take up the same space as a barplot column but it gives much more information about the population. Let’s look at a single aspect - percent_cases in the cumulative period.

# Generate a basic box plot with outliers present
covid_demographics_total.df %>% 
  # Filter for cumulative data
  filter(period == "cumulative") %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=..., y = ...) +

    theme(text = element_text(size = 20)) + # set text size

    # 4. Geoms
    ... ### 5.1.0 Add the boxplot geom
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

5.2.0 Upgrade your boxplot with some confidence intervals and values

From the looks of it, we can confirm what we saw before in our density plot - that the 20-39 age group generates the highest percentage of cases across PHUs and that with increasing age, the number of reported cases decreases as a proportion of the total.

We can add a few extra items to the plot to help us visualize the data:

  • add our data points with geom_jitter() and remove outliers from the boxplot to avoid double-plotting points.

  • mark a ~95% confidence interval within the shape of each box plot with the notch parameter.

  • you can set a variable width for each boxplot that is based on the number of samples used to generate the plot.

# Generate a basic box plot with outliers present
covid_demographics_total.df %>% 
  # Filter for cumulative data
  filter(period == "cumulative") %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=age_group, y = percent_cases) +
  
    theme(text = element_text(size = 20)) + # set text size
    theme(legend.position = "none") +
  
    # 4. Geoms
    geom_boxplot(outlier.shape=...,  ### 5.2.0 Remove outliers 
                 notch = ...) +    ### 5.2.0 Add a confidence interval notch
  
    ### 5.2.0 Add datapoints
    ...
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

5.3.0 geom_beeswarm takes plotting your points up to the next level

As you can see from above the geom_jitter() layer does add points to our boxplot by plotting the points such that they avoid overlapping as much as possible. Points are restricted to the width of the boxplot although this can also be adjusted to some degree with the right parameters. geom_jitter() is native to the ggplot2 package with some parameters that allow for a more “random” distribution of your data points within a provided area.

The goal of the ggbeeswarm package is to generate points that will not overlap but they can also be used to simultaneously simulate the kernel density of your data. There are two geoms supplied that work with the ggplot2 package to accomplish this:

geom_beeswarm() has a number of parameters that can be used to set their aes() mappings but also how the points are laid out.

  • priority: determines the method used to perform the point layout.

    • options include: ascending, descending, density, random, and none.
  • groupOnX: if TRUE then jitter is added to the x axis (default behaviour) otherwise jitter along the y axis.

  • dodge.width: the amount by which different aesthetics groups (must be a factor) will be dodged.

  • show.legend: determines of the this layer should be included in the legends.

    • options include: NA (yes if aesthetics are mapped), FALSE (never include), and TRUE (always include)
  • cex: an optional parameter that can be used to help set the spacing between values.

# Generate a basic box plot with outliers present
covid_demographics_total.df %>% 
  # Filter for cumulative data
  filter(period == "cumulative") %>% 

  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=age_group, y = percent_cases) +

    theme(text = element_text(size = 20)) + # set text size
    theme(legend.position = "none") +

    # 4. Geoms
    geom_boxplot(outlier.shape=NA,  # Remove outliers 
                 notch = TRUE) +    # Add a confidence interval notch

    ### 5.3.0 Add datapoints as a beeswarm
    ...
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

5.3.1 geom_quasirandom() adds KDE structure to your plotting

As we can see from above, the geom_beeswarm() layer adds a little more structure to the data in a somewhat rigid way. Any datapoints that are near each other are deliberately spaced out to almost represent the distribution of your data. Of course, you may run into some issues as your number of datapoints increases or as your data range increases (see the 20 to 39 age group).

To remedy this, we can balance the visualization a bit with the geom_quasirandom() function. geom_quasirandom() works similarly to the geom_beeswarm() function with emphasis on an additional method of how the points are plotted:

  • method: determine the method for distributing the points. Options include:
    • quasirandom, pseudorandom: generates a KDE before plotting points in a violin shape
    • smiley, frowney: generates a KDE, then points are binned with maximum bin values plotted on the outside or inside respectively
    • tukey: plotted more as dotstrips in a box-plot style.
  • varwidth: if TRUE, vary the width of each group based on the number of samples in the group
# Generate a basic box plot with outliers present
covid_demographics_total.df %>% 
  # Filter for cumulative data
  filter(period == "cumulative") %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=age_group, y = percent_cases) +

    theme(text = element_text(size = 20)) + # set text size
    theme(legend.position = "none") +

    # 4. Geoms
    geom_boxplot(outlier.shape=NA,  # Remove outliers 
                 notch = TRUE) +    # Add a confidence interval notch

    ### 5.3.1 Add datapoints as a quasirandom beeswarm
    ...
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

Now our data points take a more nuanced approach with a uniform width that shapes the data as a distribution. We’ll see this with even more emphasis in a few sections.

When given the need to show data distribution, try the quasirandom plotting of points over a simple beeswarm.


5.4.0 Add a third dimension to your boxplot with fill

Is there more to the data we’ve visualized? We can add a third dimension in a number of ways but the simplest would be to compare the proportions of total cases vs totals hospitalizations over the course of this pandemic. To do so, we can pivot our dataset to collapse percent_cases and percent_hospitalizations together into a single variable.

From there we’ll use the fill parameter to generate different subgroups in our boxplot to make a grouped boxplot. In doing so, we’ll also have to add the use of the geom_quasirandom() parameters:

  • dodge.width: separate the data points by any aesthetic groups that have been assigned.

  • width: set the maximium spread of your data points within each grouping

  • alpha: set the opacity of your data points so you can see more of the overlapping data.

Lastly, we’ll facet the plot by period so we can, yet again, compare the cumulative data vs a more recent snapshot of the data.

covid_demographics_total.df %>% 
  # Select for just the important columns in our analysis
  select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>% 

  # Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
  pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>% 
  
  # Plot the data as a grouped boxplot
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=age_group, y = percent_PHU_total, 
        fill = ...) + ### 5.4.0 Alter our fill aesthetic

    theme(text = element_text(size = 20)) + # set text size

    # 4. Geoms
    geom_boxplot(outlier.shape=NA, notch = TRUE) + # Add the boxplot geom
    # Add a quasirandom distribution of our data points
    geom_quasirandom(dodge.width = ..., width = ..., alpha = 0.5) +

    # 6. Facet
    ### 5.4.0 Produce our data based on period
    facet_wrap(~period, nrow = 2, scales = "free_y")
## Error in select(., period, public_health_unit, age_group, percent_cases, : object 'covid_demographics_total.df' not found

In our boxplots, we are plotting the data in both a boxplot and dotplot format. The shape of the dots helped to give a better sense of PHU distribution within each age group. You can see that the data points overlay on the box but also fall outside. Can we get the compact nature of the boxplot while still getting the visual appeal generated by a density plot?


5.5.0 Violin plot - the lovechild of density and boxplots

As the title says, the violin plot is a mixture of both the boxplot and kernel density plot. It’s a miniaturized KDE that is mirrored across its axis. It encompasses the summary information of the boxplot but in a pear or violin-shaped distribution. To generate a geom_violin() in ggplot, a minimum of three values are required. To justify using a violin, I would again suggest sticking to a similar rule of thumb of a minimum ~30 samples/observations to ensure an accurate representation of the distribution.

The nuanced visualization of a violin plot gives much more information than the box plot itself and most boxplots can be replaced with a violin plot. Despite the gateway to more detailed distribution information, this format remains less popular/familiar to scientists. Therefore its immediate accessibility to your audience can be limited.

An important parameter of this geom is scale: this sets how big each violin is in comparison to each other. It accepts the following values:

  • area: default value so all violins will have the same area.

  • count: scale areas proportionately to observations.

  • width: all violins have the same maximum width. Total area is not the same for each.

Outliers and violins: Much like a KDE, the theoretical distribution of a violin plot can generate some impossible values - especially at the tails. Remember that this is a theoretical distribution based on the data supplied. Depending on how much variation is in your data, and how many outliers it has, it can really affect the shape of your violin.

covid_demographics_total.df %>% 
  # Filter for only cumulative data
  filter(period == "cumulative") %>% 
  
  # Select for just the important columns in our analysis
  select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>% 

  # Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
  pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>% 
  
  # Plot the data as a grouped boxplot
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=age_group, y = percent_PHU_total, fill = stat_group) +

    theme(text = element_text(size = 20)) + # set text size

    # 3. Scaling
    scale_y_continuous(limits = c(0, 0.6)) + # Set the y-axis limit

    # 4. Geoms
    ... + ## Add the violin geom

    # Add a quasirandom distribution of our data points
    geom_quasirandom(dodge.width = 0.85)
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

5.6.0 Violin plots represent distributions and boxplots summarize them

The major advantage to the violin plot is that, by it’s nature, it is very sensitive to the distribution that produces the density estimate. The boxplot represents the summary information of a distribution but is always a visual representation of a normal distribution. There are not enough parameters supplied to represent anything more complex!

The violin plot is not limited in that respect. Despite some of it’s visual caveats, it can certainly detect multi-modal data. Let’s make a toy example to illustrate.

covid_demographics_total.df %>% 
  # Filter for only cumulative data
  filter(period == "cumulative") %>% 

  # Select for just the important columns in our analysis
  select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>% 
  
  # Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
  pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>%  
  
  # Plot a combined violin and boxplot to show difference in distributions
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x = percent_PHU_total, y = age_group) +  ### 5.6.0 Swap the x and y axes
    theme_bw()+
    theme(text = element_text(size = 20)) + # set text size

    # 4. Geoms
    ...(scale = "width", colour="darkviolet") + 
    geom_boxplot(alpha = 0.6) + 
    geom_quasirandom(aes(colour = ...))
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

5.7.0 Combine violin and boxplots into the ultimate plot

From our above example, you can see that we blended a number of geoms together. With a little working around, we can also plot both violins and boxplots together in a multivariate setting! This gives us the familiarity of the boxplot but also clearly displays the theoretical distribution. Some steps to accomplish this:

  1. To put emphasis on the violin plots we set the scale paramater to “width”.
  2. We need to adjust some of the boxplot parameters to fit them within the violin plots
  3. Some adjustments to the aes() parameters for our geoms to ensure our points are plotted correctly.
covid_demographics_total.df %>% 
  # Filter for only cumulative data
  filter(period == "cumulative") %>% 
  
  # Select for just the important columns in our analysis
  select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations) %>% 
  
  # Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
  pivot_longer(cols=c(4:5), names_to = "stat_group", values_to = "percent_PHU_total") %>%  
  
  # Plot the data as a grouped boxplot
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=age_group, y = percent_PHU_total) +

    theme(text = element_text(size = 20)) + # set text size

    # 3. Scaling
    scale_colour_manual(values=c("black", "black"))+ # we'll need this to fix our boxplots

    # 4. Data
    # multi-factor violin plots but keep the width consistent
    geom_violin(scale="width", aes(fill=...)) + 

    # Boxplot but smaller width so they reside "within" the violin plot
    geom_boxplot(aes(colour = ...), width=0.2, 
                 position = ..., 
                 outlier.shape=NA) + # Remove the outliers

    ### 5.7.0 Add in all of the data points
    geom_quasirandom(dodge.width = 0.85, aes(group=stat_group), alpha = 0.8)
## Error in filter(., period == "cumulative"): object 'covid_demographics_total.df' not found

Section 5.7.0 Comprehension Question: Looking at the above code for our graph, why do you think we set the aes(fill/group = stat_group) aesthetic individually for some layers rather than directly in the aes() layer?


5.8.0 Parallel coordinate plots can help visualize multivariate date

While the above visualization shows with some clarity the total distribution of our age groups, the messiness of outliers has invaded into some of the violin plots themselves. An experienced eye, however can still see that the bulk of the population centres around our internal boxplots offset by the stretched out look of our violins. We could, of course clean it up by removing some of the smaller PHU populations or removing outliers ahead of time.

Suppose, however, we wanted to add more levels to our data like percent_male_cases and percent_female_cases? With 7 age groups represented, things would start to get very crowded. To accommodate all of that data, you could facet it into 4 groups based on the statistic used but this would separate the data, which looks sharper when you can see it all on a single plot.

Organizing multiple groups, across multiple categories is the domain of the parallel coordinate plot. The GGally package is a ggplot2 extension with the ggparcoord function which allows us to look simultaneously at our 3 indicators (total cases, deaths, and hospitalizations) for each age group, linking each PHU. For each PHU at each indicator, we will draw a line connecting all values across the age groups (coordinates). We will colour our lines based on the category of the indicator.

Although GGally is an extension for ggplot2, we actually have to wrestle with our data a little bit more and put it back into a wider format. Otherwise we can treat the plot similarly after making it by changing themes etc.

The parameters for ggparcoord() require:

  • columns: a numeric vector of the location of columns that will represent the coordinates (categories/groups/variables).

  • groupColumm: the column that defines the indicator type (stat_group) for the observations.

  • scale: each coordinate should theoretically have it’s own scale but that isn’t always possible depending on the data. Instead, the default method of scaling is to normalize points in terms of the standard deviation of the data along that coordinate. You can, however, choose to just use the original scale with globalminmax if the ranges of columns aren’t too disparate.

To generate this visualization, we’ll want to:

1. Remake our list of PHUs ordered by descending case load.

2. Pivot our data to wide-format, giving each age group it’s own column in our data

# Generate the ordered list of PHUs by total cases
phu_by_cases <- phu_information.df %>% 
  slice(2:n()) %>% # drop the Ontario data
  select(1, 2) %>% # Only take geographic_area and case_count
  arrange(desc(case_count)) %>% # sort by descending order
  select(1) %>% # grab just the PHU names
  unlist() %>% # Unlist
  as.character() # Convert to a character vector

# Look at the ordered list of PHUs
phu_by_cases
##  [1] "Toronto"                                   
##  [2] "Peel"                                      
##  [3] "York Region"                               
##  [4] "Ottawa"                                    
##  [5] "Durham Region"                             
##  [6] "City of Hamilton"                          
##  [7] "Halton Region"                             
##  [8] "Region of Waterloo"                        
##  [9] "Windsor-Essex County"                      
## [10] "Simcoe Muskoka"                            
## [11] "Niagara Region"                            
## [12] "Middlesex-London"                          
## [13] "Wellington-Dufferin-Guelph"                
## [14] "Eastern Ontario"                           
## [15] "Sudbury & Districts"                       
## [16] "Southwestern"                              
## [17] "Kingston, Frontenac and Lennox & Addington"
## [18] "Brant County"                              
## [19] "Lambton"                                   
## [20] "Thunder Bay"                               
## [21] "Haliburton, Kawartha, Pine Ridge"          
## [22] "Haldimand-Norfolk"                         
## [23] "Hastings Prince Edward"                    
## [24] "Chatham-Kent"                              
## [25] "Leeds, Grenville & Lanark"                 
## [26] "Grey Bruce"                                
## [27] "Huron Perth"                               
## [28] "Peterborough"                              
## [29] "Porcupine"                                 
## [30] "Algoma"                                    
## [31] "Northwestern"                              
## [32] "North Bay Parry Sound"                     
## [33] "Renfrew County"                            
## [34] "Timiskaming"
covid_pcp.df <-

#head(covid_demographics_total.df)
covid_demographics_total.df %>% 

  # Filter for the top 15 PHUs by case load
  filter(public_health_unit %in% phu_by_cases[1:15], 
         period == "cumulative"
        ) %>% 
  
  # Select for just the important columns
  select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations, 
         percent_male_cases, percent_female_cases) %>% 
  
  # Pivot the modified table to capture the "stat_group" of percent_cases
  # percent_hospitalizations, and percent_x_cases
  pivot_longer(cols=c(4:7), names_to = "stat_group", values_to = "percent_PHU_total") %>% 
  
  # We're going to reorder the factor ahead of time
  # Probably a better way to do this but requiring more code
  mutate(stat_group = factor(stat_group, levels=c("percent_hospitalizations", "percent_cases", 
                                                  "percent_male_cases", "percent_female_cases"))) %>% 
  
  # Now we're going to pivot out wide to give each age_group it's own column
  pivot_wider(names_from = age_group, values_from = c(percent_PHU_total))  %>% 
  
  # Fix the positioning of our factor levels
  relocate('5 to 11', .after = '0 to 4')
## Error in filter(., public_health_unit %in% phu_by_cases[1:15], period == : object 'covid_demographics_total.df' not found
head(covid_pcp.df)
## Error in head(covid_pcp.df): object 'covid_pcp.df' not found
# Generate a ggplot object
# 1. Data and Geoms
ggparcoord(..., 
           columns=..., 
           groupColumn=...,
           showPoints = ..., 
           scale=...) +

  # 2. Aesthetics
  theme_bw() +
  theme(text = element_text(size = 20)) + # set text size
  xlab("Age group") +
  ylab("Percent of category by individual PHU") +   
  guides(colour = guide_legend(title="Category"))
## Error in ggparcoord(..., columns = ..., groupColumn = ..., showPoints = ..., : could not find function "ggparcoord"

5.8.1 group your data by multiple variables with the interaction() function

Sometimes in your data you may want specific parts of your data to be grouped together, even if you aren’t planning on using that information directly to determine a specific aesthetic category like size, colour or fill.

For example, our parallel coordinate plot from above, could be reproduced directly in ggplot. As long as we have our data in the proper long format, we can use specific variables to ensure our data is displayed properly by using the group aesthetic.

In the case of our data, we’d like to be sure we can group our data lines by following their age_group and data categories (ie percent_cases, percent_hospitalizations, etc.). In our wrangled dataframe this will fall under the stat_group variable after pivoting longer.

The problem with this strategy, however, is that group does not normally allow for more than one variable - like most aesthetics. We can solve this with the interaction() function which will allow us to name multiple factors that will be used to formally categorize observations as a composite of those factors.

With that in mind, let’s mock up a quick parallel coordinate plot of our own!

covid_demographics_total.df %>% 
  # Filter for only cumulative data and reduce the PHU data to 15 groups
  filter(public_health_unit %in% phu_by_cases[1:15], 
         period == "cumulative"
        ) %>% 
  
  # Select for just the important columns in our analysis
  select(period, public_health_unit, age_group, percent_cases, percent_hospitalizations, 
         percent_male_cases, percent_female_cases) %>% 
  
  # Pivot the modified table to capture the "stat_group" of percent_cases vs percent_hospitalizations
  pivot_longer(cols=c(4:7), names_to = "stat_group", values_to = "percent_PHU_total") %>%  
  
  ggplot() + 
    # 2. Aesthetics
    aes(x = age_group, y = percent_PHU_total) +

    # Set a basic theme
    theme_bw()+
    theme(text = element_text(size = 20)) + # set text size

    # 4. Geoms
    ### 5.8.1 A line plot to join the points
    geom_line(aes(group = ..., 
                  colour = stat_group)) +

    ### 5.8.1 Add the points in also to help follow the connections
    geom_point(aes(colour = stat_group))
## Error in filter(., public_health_unit %in% phu_by_cases[1:15], period == : object 'covid_demographics_total.df' not found

Looks nearly the same AND you don’t have to remember how to properly format your data for using ggparcoord!

6.0.0 Class summary

We’ve covered a number of key plots today, including when and how to use them. Next week we’ll revisit some of these plots and spruce them up with extra touches that will take them that extra distance. Below you’ll find a summary of what we’ve discussed today.

6.1.0 Summary of plots

Plot Key Features Notes
Scatterplot Good for exploring relationships between variables Bubbleplots add an extra dimension to your data
Barplot Present values across groups. Presenting proportions, small sample sizes
Stack categories for extra dimension Does not dissect individual distributions
Nightingale plot Circular-wedge barplot, same properties as barplot Presenting data over unordered groups
Visual emphasis on outer area size may mislead reader
Racetrack plot Circular-ringed barplot, same properties as barplot Looking for a more compact way to show barplots
Calculate length by radians as outer rings are “stretched”
Density plot Theoretical distribution of your sample data Minimum sample size 4 but 30 is more reliable
Can plot up to 5 distributions on same axis
Tails can produce “ghost” data
Ridgeplots Allows tighter visualizations of multiple densities Good way to pack more KDEs into a smaller area
Same properties as KDEs No real control of outliers
Similar “ghost” data issues
Box and whisker plot Summarize distributions with 5 parameters Popular and compact presentation of simple populations
Minimum sample size = 5
Does not properly visualize multi-modal data
Violin plots Boxplot format with KDE violin shape Compact representation of distribution shape
Less popular with nuanced interpretation
Inherits “ghost” data and other properties of KDE
DOES interpret multi-modal populations
Parallel coordinate plot Visual representation of multivariate data Connects trends across groups
Related data can be connected linearly Not limited by number of samples
Can help identify trends within multicategorical data

6.2.0 Weekly assignment

This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: March 28th, 2024.


6.3.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.